Data Mining: Review Model Regresi

Pendahuluan Model Regresi¶

  • Digunakan saat variabel tak bebas (Dependent variable - Y) bertipe numerik (float/real) dan variabel bebasnya bisa numerik dan-atau kategorik

Berawal dari Pusat data dan Variansi¶

$\bar{x}=\sum_{i=1}^{N}{x_i}$ dan $s^2=\frac{\sum_{i=1}^{N}{(x_i-\bar{x})}}{N-1}$

  • Perhatikan makna rumus/formula variansi, lalu bandingkan dengan formula "covariansi" berikut:

Variance ke Covariance: Menghitung Hubungan Linear antara 2 variabel¶

  • How? Bagaimana cara kerjanya? (Statistical Thinking)
  • Konsepnya: "Co-Vary" sama-sama bervariansi menjauh dari rata-rata.
  • Gunakan "reverse" thinking untuk memahaminya.
  • Penggunaan: cov(x,y) = 2 VS cov(x,y) = -2 VS Cov(x,y) = 0
  • Covariance = 3000? Apa artinya?
No description has been provided for this image

Covariance ke korelasi: Statistical Thinking¶

  • Korelasi sebenarnya adalah Covariance dibagi dengan masing-masing standar deviasinya.
  • Apa maksud/maknanya?
  • Covariance punya makna geometric .... ia adalah Cosine!...

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Geometric_interpretation

No description has been provided for this image

Nilai koefisien korelasi (Linear) "Pearson"¶

  • Nilai dari koefisien korelasi Pearson adalah dari -1 hingga +1
No description has been provided for this image

Hati-hati¶

  • Koefisien korelasi = 0 bukan berarti tidak ada hubungan antara kedua variable. Yang benar adalah: tidak ada hubungan LINIER, tapi bisa jadi ada hubungan dalam bentuk lain; misal: kuadratik, atau fungsi lain selain linier, seperti pada contoh di atas.

Korelasi dan Sebab-Akibat¶

No description has been provided for this image
  • Semua orang yang minum air putih mati

Penilaian Kualitatif terhadap nilai korelasi seperti ini? ... Really? Why? Why not?¶

No description has been provided for this image
  • image Source: https://spencermath.weebly.com/home/interpreting-the-correlation-coefficient
  • Cases (social, medicine, etc)
  • Objective, prediction vs insights

Contoh kasus sederhana¶

No description has been provided for this image
In [1]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt, numpy as np
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import statsmodels.api as sm, scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
from sklearn.datasets import load_boston
from sklearn.preprocessing import MinMaxScaler
plt.style.use('bmh'); sns.set()
"Done"
Out[1]:
'Done'
In [2]:
data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138, 142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)
Out[2]:
usia tekanan_darah
0 40 126
1 45 124
2 50 135
3 53 138
4 60 142
5 65 139
6 69 140
7 71 151
In [3]:
# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()
Covariance =  76.953125
Correlations =  [[1.         0.88746015]
 [0.88746015 1.        ]]
No description has been provided for this image
In [4]:
# Better
print(df.corr())
sns.heatmap(df.corr(),cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,annot=True, annot_kws={"size": 16}, square=True)
p = sns.pairplot(df)
                  usia  tekanan_darah
usia           1.00000        0.88746
tekanan_darah  0.88746        1.00000
No description has been provided for this image
No description has been provided for this image

Interpretasi¶

  • Nilai ~0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.

  • WARNING

  • Korelasi tidak sama (meng-implikasikan) dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia, tapi faktor lain yang tidak teramati pada data.

  • Contoh lain penelitian di Machine learning (kecantikan dan confidence/Panjang Jari dan IQ)

WARNING¶

  • Korelasi tidak sama dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia, tapi faktor lain yang tidak teramati pada data.

  • Sampai sini kita memahami kalau keduanya berhubungan, tapi seperti apa hubungannya kita masih belum bisa ketahui (lewat korelasi). Itulah Mengapa kita perlu Model Regresi.

Regresi Linier Sederhana¶

No description has been provided for this image

Korelasi ke Regresi¶

No description has been provided for this image

Bagaimana menghitung parameter Regresi yang Optimal?¶

  • Kenapa rumusnya seperti ini?
  • Pentingnya memahami "Loss Function"
No description has been provided for this image

Evaluasi Error (Mean Squared Error)¶

No description has been provided for this image
  • Hati-hati,... perhatikan rumusnya dengan baik .... ia tidak robust terhadap outlier
  • $\hat{y}=β_0+β_1x_1+...+β_nx_p$
  • MSE = total jarak/selisih antara prediksi dan nilai dari data (sesungguhnya)
  • RMSE = $\sqrt{MSE}$ ... why?
  • Evaluasi penting ketika kita ingin melakukan prediksi
In [5]:
# Fitting model Regresi Sederhana
lm = smf.ols("tekanan_darah ~ usia", data=df[['tekanan_darah','usia']]).fit()
lm.summary()
# 1. F-Stat. 
#.2. Uji Koef model
#.3. R^2
#.4. Interpretasi Model
#.5. Durbin-Watson ==> Time Series?
Out[5]:
OLS Regression Results
Dep. Variable: tekanan_darah R-squared: 0.788
Model: OLS Adj. R-squared: 0.752
Method: Least Squares F-statistic: 22.25
Date: Wed, 18 Sep 2024 Prob (F-statistic): 0.00327
Time: 14:37:47 Log-Likelihood: -21.920
No. Observations: 8 AIC: 47.84
Df Residuals: 6 BIC: 48.00
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 98.5623 8.266 11.924 0.000 78.337 118.788
usia 0.6766 0.143 4.717 0.003 0.326 1.028
Omnibus: 3.192 Durbin-Watson: 2.005
Prob(Omnibus): 0.203 Jarque-Bera (JB): 1.016
Skew: -0.340 Prob(JB): 0.602
Kurtosis: 1.392 Cond. No. 311.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [8]:
# Plot the Data
p = sns.regplot(data=df, x="usia", y="tekanan_darah")
No description has been provided for this image

Evaluasi $R^2$: Model VS Tidak Pakai Model?¶

No description has been provided for this image
  • $SSR=SST-SSE=\sum{(y_i-\bar{y}_i)^2}-\sum{(y_i-\hat{y}_i)^2}$

Adjusted R-Squared? Why?¶

No description has been provided for this image

Pengaruh Variabel Tak Bebas ke Model¶

No description has been provided for this image

All Models Are Wrong¶

No description has been provided for this image
  • Perfect/true-best model tidak ada, bahkan seringnya tidak diperlukan

Pahami Asumsi-Asumsi di Regresi dengan Baik¶

https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html¶

No description has been provided for this image
No description has been provided for this image
  • image source: https://www.slideshare.net/mahakvijay3/basics-of-regression-analysis

Regresi Non-Linier¶

  • Why?
  • Kapan tidak disarankan menambah kompleksitas model?
  • Regression for insights VS regression for prediction.
  • Masih linear terhadap parameter
No description has been provided for this image
  • image source: https://sites.google.com/site/apphysics1online/appendices/2-data-analysis/graph-linearization
In [9]:
# Loading Data Sampel dari Modul
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head(), df.shape, set(df['Region'])
Out[9]:
(   Lottery  Literacy  Wealth Region
 0       41        37      73      E
 1       38        51      22      N
 2       66        13      61      C
 3       80        46      76      E
 4       79        69      83      E,
 (85, 4),
 {'C', 'E', 'N', 'S', 'W'})
In [10]:
# Set "Region" sebagai variabel Kategorik
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)
print(res.summary())
Intercept         38.651655
C(Region)[T.E]   -15.427785
C(Region)[T.N]   -10.016961
C(Region)[T.S]    -4.548257
C(Region)[T.W]   -10.091276
Literacy          -0.185819
Wealth             0.451475
dtype: float64
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Wed, 18 Sep 2024   Prob (F-statistic):           1.07e-05
Time:                        14:38:58   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         38.6517      9.456      4.087      0.000      19.826      57.478
C(Region)[T.E]   -15.4278      9.727     -1.586      0.117     -34.793       3.938
C(Region)[T.N]   -10.0170      9.260     -1.082      0.283     -28.453       8.419
C(Region)[T.S]    -4.5483      7.279     -0.625      0.534     -19.039       9.943
C(Region)[T.W]   -10.0913      7.196     -1.402      0.165     -24.418       4.235
Literacy          -0.1858      0.210     -0.886      0.378      -0.603       0.232
Wealth             0.4515      0.103      4.390      0.000       0.247       0.656
==============================================================================
Omnibus:                        3.049   Durbin-Watson:                   1.785
Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694
Skew:                          -0.340   Prob(JB):                        0.260
Kurtosis:                       2.454   Cond. No.                         371.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [11]:
# Non Linear transformation
res = smf.ols(formula='Lottery ~ np.log(Literacy) + Wealth -1', data=df).fit()
print(res.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                Lottery   R-squared (uncentered):                   0.799
Model:                            OLS   Adj. R-squared (uncentered):              0.794
Method:                 Least Squares   F-statistic:                              165.2
Date:                Wed, 18 Sep 2024   Prob (F-statistic):                    1.16e-29
Time:                        14:38:58   Log-Likelihood:                         -384.16
No. Observations:                  85   AIC:                                      772.3
Df Residuals:                      83   BIC:                                      777.2
Df Model:                           2                                                  
Covariance Type:            nonrobust                                                  
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
np.log(Literacy)     4.6426      1.246      3.727      0.000       2.165       7.120
Wealth               0.5853      0.089      6.571      0.000       0.408       0.762
==============================================================================
Omnibus:                        4.188   Durbin-Watson:                   1.892
Prob(Omnibus):                  0.123   Jarque-Bera (JB):                4.034
Skew:                          -0.480   Prob(JB):                        0.133
Kurtosis:                       2.533   Cond. No.                         25.8
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Studi Kasus (Boston House Pricing) - Another Property Case Study¶

  • Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
No description has been provided for this image
In [12]:
# Loading Data
boston = load_boston()
# Convert ke Pandas Dataframe
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
bos.head()
Out[12]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [13]:
# Deskripsi Data
print(boston.DESCR)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [14]:
bos.describe(include='all')
Out[14]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
In [15]:
p = sns.pairplot(bos)
No description has been provided for this image

Checking Correlations between Predictors¶

In [16]:
# HeatMap untuk menyelidiki korelasi
corr2 = bos.corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))

sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True);
No description has been provided for this image
In [17]:
m = ols('PRICE ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m.summary())
# Jangan lupa analisa dan interpretasi hasilnya
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.679
Model:                            OLS   Adj. R-squared:                  0.677
Method:                 Least Squares   F-statistic:                     353.3
Date:                Wed, 18 Sep 2024   Prob (F-statistic):          2.69e-123
Time:                        14:39:30   Log-Likelihood:                -1553.0
No. Observations:                 506   AIC:                             3114.
Df Residuals:                     502   BIC:                             3131.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     18.5671      3.913      4.745      0.000      10.879      26.255
RM             4.5154      0.426     10.603      0.000       3.679       5.352
PTRATIO       -0.9307      0.118     -7.911      0.000      -1.162      -0.700
LSTAT         -0.5718      0.042    -13.540      0.000      -0.655      -0.489
==============================================================================
Omnibus:                      202.072   Durbin-Watson:                   0.901
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1022.153
Skew:                           1.700   Prob(JB):                    1.10e-222
Kurtosis:                       9.076   Cond. No.                         402.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [18]:
m2 = ols('np.log(PRICE) ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(PRICE)   R-squared:                       0.714
Model:                            OLS   Adj. R-squared:                  0.713
Method:                 Least Squares   F-statistic:                     418.4
Date:                Wed, 18 Sep 2024   Prob (F-statistic):          3.96e-136
Time:                        14:39:30   Log-Likelihood:                 52.201
No. Observations:                 506   AIC:                            -96.40
Df Residuals:                     502   BIC:                            -79.50
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.5469      0.164     21.632      0.000       3.225       3.869
RM             0.1044      0.018      5.849      0.000       0.069       0.139
PTRATIO       -0.0391      0.005     -7.927      0.000      -0.049      -0.029
LSTAT         -0.0353      0.002    -19.974      0.000      -0.039      -0.032
==============================================================================
Omnibus:                       44.245   Durbin-Watson:                   0.916
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              179.110
Skew:                           0.246   Prob(JB):                     1.28e-39
Kurtosis:                       5.873   Cond. No.                         402.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [19]:
#define figure size
fig = plt.figure(figsize=(12,8))
#produce regression plots
fig = sm.graphics.plot_regress_exog(m,'RM', fig=fig)
No description has been provided for this image
In [20]:
plt.rc("figure", figsize=(16,12))
plt.rc("font", size=14)
fig = sm.graphics.plot_fit(m, "RM")
fig.tight_layout(pad=1.0)
No description has been provided for this image
In [28]:
model_fitted_y = m.fittedvalues # model residuals
model_residuals = m.resid # normalized residuals
model_norm_residuals = m.get_influence().resid_studentized_internal # absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = m.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = m.get_influence().cooks_distance[0]
bos["model_fitted_y"] = model_fitted_y
p = sns.residplot(data=bos, x="model_fitted_y", y="PRICE", lowess=True, line_kws=dict(color="r"))
No description has been provided for this image

Variable Selection: Stepwise di Analisis Regresi¶

No description has been provided for this image No description has been provided for this image
  • image source: https://quantifyinghealth.com/stepwise-selection/
  • image source: https://en.wikipedia.org/wiki/Stepwise_regression
  • Cautions: https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df
In [29]:
def forward_selected(data, response):
    """Linear model designed by forward selection.
    https://planspace.org/20150423-forward_selection_with_statsmodels/
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response

    response: string, name of response column in data

    Returns:
    --------
    model: an "optimal" fitted statsmodels linear model
           with an intercept
           selected by forward selection
           evaluated by adjusted R-squared
    """
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response, ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model
In [30]:
model = forward_selected(bos, 'PRICE')
print(model.model.formula)
print(model.rsquared_adj)
PRICE ~ model_fitted_y + DIS + NOX + CHAS + B + ZN + CRIM + RAD + TAX + 1
0.7353524066872489
In [31]:
# Interpretasi koefisien?
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.740
Model:                            OLS   Adj. R-squared:                  0.735
Method:                 Least Squares   F-statistic:                     156.9
Date:                Wed, 18 Sep 2024   Prob (F-statistic):          5.73e-139
Time:                        14:43:46   Log-Likelihood:                -1499.4
No. Observations:                 506   AIC:                             3019.
Df Residuals:                     496   BIC:                             3061.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         15.4968      2.851      5.436      0.000       9.896      21.098
model_fitted_y     0.8982      0.037     24.483      0.000       0.826       0.970
DIS               -1.4816      0.183     -8.096      0.000      -1.841      -1.122
NOX              -16.4303      3.300     -4.979      0.000     -22.914      -9.947
CHAS               2.7559      0.852      3.233      0.001       1.081       4.430
B                  0.0093      0.003      3.534      0.000       0.004       0.015
ZN                 0.0484      0.013      3.743      0.000       0.023       0.074
CRIM              -0.1074      0.033     -3.298      0.001      -0.171      -0.043
RAD                0.2863      0.062      4.626      0.000       0.165       0.408
TAX               -0.0120      0.003     -3.573      0.000      -0.019      -0.005
==============================================================================
Omnibus:                      182.457   Durbin-Watson:                   1.068
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              846.346
Skew:                           1.543   Prob(JB):                    1.65e-184
Kurtosis:                       8.534   Cond. No.                     1.08e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Bandingkan Durbin-Watson in[15] dengan Durbin-Watson di [29] Comment on Jarque-Bera

Data Scaling "for Insights"¶

No description has been provided for this image No description has been provided for this image No description has been provided for this image
  • Pentingnya "scaling" di Regresi (atau clustering) untuk mencari insight dari data
  • image source: https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e
In [32]:
scaler = MinMaxScaler()
bos[['TAX', 'AGE', 'B']] = scaler.fit_transform(bos[['TAX', 'AGE', 'B']])
bos.head()
# Continue to Modelling
Out[32]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE model_fitted_y
0 0.00632 18.0 2.31 0.0 0.538 6.575 0.641607 4.0900 1.0 0.208015 15.3 1.000000 4.98 24.0 31.168357
1 0.02731 0.0 7.07 0.0 0.469 6.421 0.782698 4.9671 2.0 0.104962 17.8 1.000000 9.14 21.6 25.767464
2 0.02729 0.0 7.07 0.0 0.469 7.185 0.599382 4.9671 2.0 0.104962 17.8 0.989737 4.03 34.7 32.139173
3 0.03237 0.0 2.18 0.0 0.458 6.998 0.441813 6.0622 3.0 0.066794 18.7 0.994276 2.94 33.4 31.080407
4 0.06905 0.0 2.18 0.0 0.458 7.147 0.528321 6.0622 3.0 0.066794 18.7 1.000000 5.33 36.2 30.386589

Pitfalls: Regresi Interpolation "bukan" Extrapolation (Forecasting/Peramalan)¶

No description has been provided for this image
  • image source: https://www.datasciencecentral.com/forum/topics/what-are-the-differences-between-prediction-extrapolation-and

Belum dibahas:

  1. Logistic Regression [akan dibahas saat Topik Klasifikasi]
  2. Piecewise Regression (Non Linear)
  3. Probit/Tobit Regression (Probabilistic)
  4. Bayesian Regressian
  5. Logic Regression (lebih robust dari logistic regression utk Fraud Detection)
  6. Quantile regression (extreme events)
  7. LAD regression (L1)
  8. Jackknife regression
  9. SVR
  10. ARIMA (Time Series)
  11. Ecologic Regression

image Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html¶

Latihan Studi Kasus Investasi Biaya Iklan¶

In [33]:
# Contoh
# Load DataFile CSV
try:
    df = pd.read_csv('data/iklan.csv') # run locally
except:
    !wget https://raw.githubusercontent.com/taudataanalytics/Data-Mining--Penambangan-Data--Ganjil-2024/master/data/iklan.csv # "Google Colab"
    df = pd.read_csv('iklan.csv') 
df.head()
Out[33]:
No Iklan Laba Tipe
0 1 10 9.17 1
1 2 1 1.32 0
2 3 12 8.54 1
3 4 12 7.68 1
4 5 5 7.15 1
In [34]:
p = sns.pairplot(df, hue="Tipe")
No description has been provided for this image
In [35]:
# Do Modelling Here ... Don't forget to interpret.

Akhir Modul - Review Regression Analysis¶


No description has been provided for this image